Refactor scale-down for better integration with drainability rules #6135

artemvmin · 2023-09-23T01:19:52Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

This is one of several CLs to move all drain conditions to drainability rules. Once complete, clients implementing custom drainability rules will have full control over the scale-down of nodes.

Notable changes:

Split out drainability rules into separate packages. simulation/drainability:Rule.Drainable() function now takes a DrainContext. This function can assume that DrainContext is not nil.
Refactor NodeDeleteOptions for reuse in the drainability package. simulation:NodeDeleteOptions has been split into two structs: simulation/options:NodeDeleteOptions and drainability/rules:Rules. Consumers of this struct have been updated accordingly.

Does this PR introduce a user-facing change?

None

k8s-ci-robot · 2023-09-23T01:19:59Z

Welcome @artemvmin!

It looks like this is your first PR to kubernetes/autoscaler 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/autoscaler has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

x13n · 2023-09-26T07:49:37Z

/assign

x13n · 2023-09-27T08:03:00Z

/cc @olagacek

k8s-ci-robot · 2023-09-27T08:03:02Z

@x13n: GitHub didn't allow me to request PR reviews from the following users: olagacek.

Note that only kubernetes members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to this:

/cc @olagacek

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

cluster-autoscaler/simulator/drain.go

x13n · 2023-09-27T08:29:15Z

cluster-autoscaler/simulator/drain.go

@@ -106,39 +87,6 @@ func GetPodsToMove(nodeInfo *schedulerframework.NodeInfo, deleteOptions NodeDele
 	if err != nil {
 		return pods, daemonSetPods, blockingPod, err
 	}
-	if pdbBlockingPod, err := checkPdbs(pods, pdbs); err != nil {


By moving this check before GetPodsForDeletionOnNodeDrain, you're changing the logic - it will now operate on a different set of pods. In particular, it will start to check PDBs for DS pods, which I think doesn't make sense - we don't want to block node removal on this. I think rewriting GetPodsForDeletionOnNodeDrain into drainability rules first (as mentioned in TODO above) would be a safer approach, since then you could preserve the ordering.

I became aware of this ordering issue early on and decided to turn this PR into the full story. I reordered the commits and updated the title. I'll ping for a follow-up review when the remainder is implemented.

x13n · 2023-09-27T08:38:26Z

cluster-autoscaler/simulator/drainability/rules/pdb/pdb.go

+	// require adding information to the DrainContext, such as the slice of pods
+	// and a flag to prevent duplicate checks.
+	for _, pdb := range drainCtx.Pdbs {
+		selector, err := metav1.LabelSelectorAsSelector(pdb.Spec.Selector)


With this implementation we're doing the same conversion O(N*M) times instead of O(N) as before. (Where N is the number of PDBs and M is the number of pods.) Could we keep selectors, rather than just raw pdbs, in the context? Or - even better - reuse RemainingPdbTracker which already operates in that way?

I had a separate draft passing RemainingPdbTracker directly, but without using DrainCtx. The combination of these two ideas seems to be the sweet spot. Initially didn't follow through with it because I got scared of the asynchronous go-routines using the NodesToRemove function (https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/scaledown/actuation/actuator.go#L219) and the lack of multi-thread safety in the RemainingPdbTracker object. Do you think this is an issue?

I added the refactor to this PR.

Ah, good question. As far as I can tell though, you're using a new instance of the tracker in each such goroutine. This is perhaps suboptimal, but should be safe.

...r-autoscaler/processors/scaledowncandidates/emptycandidates/empty_candidates_sorting_test.go

cluster-autoscaler/main.go

x13n · 2023-09-28T09:20:04Z

The current refactor looks good to me now. Are you sure you want to rewrite all the checks in one humongous PR? You may start getting into merge conflicts, so I'd suggest following up in a separate PR - WDYT?

Btw, in the release notes you're adding some actions required - while this is true for downstream/forked code, it may be confusing for OSS CA release notes (which is the intended audience for these). I don't think this change should have any user-visible changes.

artemvmin · 2023-09-28T19:07:24Z

The current refactor looks good to me now. Are you sure you want to rewrite all the checks in one humongous PR? You may start getting into merge conflicts, so I'd suggest following up in a separate PR - WDYT?

Sounds good. Thanks for the tip. I updated the PR title.

Btw, in the release notes you're adding some actions required - while this is true for downstream/forked code, it may be confusing for OSS CA release notes (which is the intended audience for these). I don't think this change should have any user-visible changes.

That makes sense. Updated.

Please review.

x13n · 2023-09-29T13:42:59Z

cluster-autoscaler/simulator/drain.go

@@ -106,7 +90,7 @@ func GetPodsToMove(nodeInfo *schedulerframework.NodeInfo, deleteOptions NodeDele
 	if err != nil {
 		return pods, daemonSetPods, blockingPod, err
 	}
-	if pdbBlockingPod, err := checkPdbs(pods, pdbs); err != nil {
+	if pdbBlockingPod, err := checkPdbs(pods, remainingPdbTracker.GetPdbs()); err != nil {


It seems you could already delete checkPdbs function now and just use RemainingPdbTracker.CanRemovePods().

Done. I wasn't sure what the "legacy scale-down" was referring to in the comment below so I left it alone. The logic looks identical if the parallel return value is ignored.

x13n · 2023-09-29T13:50:29Z

cluster-autoscaler/simulator/drainability/rules/pdb/pdb.go

+	// require adding information to the DrainContext, such as the slice of pods
+	// and a flag to prevent duplicate checks.
+	for _, pdb := range drainCtx.Pdbs {
+		selector, err := metav1.LabelSelectorAsSelector(pdb.Spec.Selector)


Ah, good question. As far as I can tell though, you're using a new instance of the tracker in each such goroutine. This is perhaps suboptimal, but should be safe.

x13n · 2023-09-29T13:52:45Z

/lgtm
/approve
/hold

I only really have one minor comment, but it doesn't block this PR, so feel free to cancel the hold if you don't want to address it now.

k8s-ci-robot · 2023-09-29T13:53:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: artemvmin, x13n

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [x13n]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

artemvmin · 2023-09-29T17:50:45Z

Comments addressed.

/unhold

x13n · 2023-09-29T17:51:54Z

Thanks!

/lgtm

x13n · 2023-09-29T18:46:23Z

/lgtm

k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Sep 23, 2023

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 23, 2023

k8s-ci-robot requested review from feiskyer and x13n September 23, 2023 01:20

k8s-ci-robot added the area/cluster-autoscaler label Sep 23, 2023

k8s-ci-robot assigned x13n Sep 26, 2023

artemvmin force-pushed the pdb-drainability-rule-dynamic branch 7 times, most recently from fe5cf63 to aba1391 Compare September 27, 2023 06:54

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 27, 2023

artemvmin force-pushed the pdb-drainability-rule-dynamic branch 2 times, most recently from c162820 to e53147e Compare September 27, 2023 07:35

x13n requested changes Sep 27, 2023

View reviewed changes

artemvmin force-pushed the pdb-drainability-rule-dynamic branch 2 times, most recently from 84ad488 to 98effe0 Compare September 27, 2023 18:23

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 27, 2023

artemvmin changed the title ~~Convert scale-down pdb check to drainability rule~~ Convert scale-down checks to drainability rules Sep 27, 2023

artemvmin changed the title ~~Convert scale-down checks to drainability rules~~ Refactor scale-down for better integration with drainability rules Sep 28, 2023

artemvmin force-pushed the pdb-drainability-rule-dynamic branch from 98effe0 to 8f2532a Compare September 28, 2023 23:48

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 28, 2023

artemvmin force-pushed the pdb-drainability-rule-dynamic branch from 8f2532a to 5a2f46e Compare September 28, 2023 23:52

x13n reviewed Sep 29, 2023

View reviewed changes

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Sep 29, 2023

k8s-ci-robot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Sep 29, 2023

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Sep 29, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 29, 2023

artemvmin added 3 commits September 29, 2023 17:55

Extract drainability rules into packages

af63873

Refactor NodeDeleteOptions for use in drainability rules

a68b748

Replace old pdb check with remainingPdbTracker

9ea5a36

artemvmin force-pushed the pdb-drainability-rule-dynamic branch from 5892f72 to 9ea5a36 Compare September 29, 2023 17:55

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 29, 2023

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 29, 2023

k8s-ci-robot merged commit 2f7c61e into kubernetes:master Sep 29, 2023
4 checks passed

artemvmin deleted the pdb-drainability-rule-dynamic branch September 29, 2023 18:49

artemvmin mentioned this pull request Sep 30, 2023

Convert scale-down checks to drainability rules #6164

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor scale-down for better integration with drainability rules #6135

Refactor scale-down for better integration with drainability rules #6135

artemvmin commented Sep 23, 2023 •

edited

Loading

k8s-ci-robot commented Sep 23, 2023

x13n commented Sep 26, 2023

x13n commented Sep 27, 2023

k8s-ci-robot commented Sep 27, 2023

x13n Sep 27, 2023

artemvmin Sep 27, 2023

x13n Sep 27, 2023

artemvmin Sep 28, 2023 •

edited

Loading

x13n Sep 29, 2023

x13n commented Sep 28, 2023

artemvmin commented Sep 28, 2023 •

edited

Loading

x13n Sep 29, 2023

artemvmin Sep 29, 2023

x13n Sep 29, 2023

x13n commented Sep 29, 2023

k8s-ci-robot commented Sep 29, 2023

artemvmin commented Sep 29, 2023

x13n commented Sep 29, 2023

x13n commented Sep 29, 2023

Refactor scale-down for better integration with drainability rules #6135

Refactor scale-down for better integration with drainability rules #6135

Conversation

artemvmin commented Sep 23, 2023 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Does this PR introduce a user-facing change?

k8s-ci-robot commented Sep 23, 2023

x13n commented Sep 26, 2023

x13n commented Sep 27, 2023

k8s-ci-robot commented Sep 27, 2023

x13n Sep 27, 2023

Choose a reason for hiding this comment

artemvmin Sep 27, 2023

Choose a reason for hiding this comment

x13n Sep 27, 2023

Choose a reason for hiding this comment

artemvmin Sep 28, 2023 • edited Loading

Choose a reason for hiding this comment

x13n Sep 29, 2023

Choose a reason for hiding this comment

x13n commented Sep 28, 2023

artemvmin commented Sep 28, 2023 • edited Loading

x13n Sep 29, 2023

Choose a reason for hiding this comment

artemvmin Sep 29, 2023

Choose a reason for hiding this comment

x13n Sep 29, 2023

Choose a reason for hiding this comment

x13n commented Sep 29, 2023

k8s-ci-robot commented Sep 29, 2023

artemvmin commented Sep 29, 2023

x13n commented Sep 29, 2023

x13n commented Sep 29, 2023

artemvmin commented Sep 23, 2023 •

edited

Loading

artemvmin Sep 28, 2023 •

edited

Loading

artemvmin commented Sep 28, 2023 •

edited

Loading